# EnokiQA Dataset

A large-scale dataset for hallucination detection in long-form question answering.

## Files

| File | Records | Size | Description |
|------|---------|------|-------------|
| `enokiqa_test.jsonl` | 1,995 | 215 MB | Annotated test set (balanced: 285 per model) |
| `enokiqa_train.jsonl` | 19,594 | 384 MB | Unannotated training set |

## Overview

Each record is a long-form factual question about a Wikipedia topic, answered by one of seven LLMs in a **no-context** setting (parametric knowledge only). Verification contexts (paragraph and full Wikipedia article) are provided for each record. The test set includes automatic hallucination annotations at the claim and span level.

**Generator models:** Qwen2.5-{7B, 14B, 32B}-Instruct, Qwen3-{4B, 8B}, Llama-3.1-8B-Instruct, Mixtral-8x7B-Instruct

## Schema

### Common fields (train + test)

| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique identifier (`{split}_{index}`) |
| `split` | string | `"train"` or `"test"` |
| `title` | string | Wikipedia article title |
| `question` | string | Generated long-form factual question |
| `answer` | string | Model-generated answer (no-context setting) |
| `answer_model` | string | Generator model name |
| `answer_length` | int | Answer length in characters |
| `paragraph_context` | string | Paragraph used for question generation |
| `full_page_context` | string | Full Wikipedia article text |
| `context_id` | string | Hash identifying the paragraph context |
| `wiki_url` | string | Full URL to the Wikipedia article |
| `wiki_pageid` | int | Wikipedia page ID |
| `wiki_categories` | list[string] | Wikipedia categories |
| `wiki_qid` | string | Wikidata QID |
| `wiki_article_length` | int | Article length in characters (from API) |
| `pv_mean` | float | Mean daily pageviews (90-day window) |
| `pv_total` | int | Total pageviews in window |
| `pv_p50` | float | Median daily pageviews |
| `pv_p95` | float | 95th percentile daily pageviews |
| `popularity_tier` | string | `"low"` (<100), `"medium"` (100-1000), `"high"` (>1000) |

### Annotation fields (test only)

| Field | Type | Description |
|-------|------|-------------|
| `n_facts` | int | Number of extracted atomic facts |
| `n_hallucinated` | int | Number of hallucinated facts (`hall_prob >= 0.5`) |
| `hall_rate` | float | Fraction of hallucinated facts |
| `mean_hall_prob` | float | Mean hallucination probability across facts |
| `max_hall_prob` | float | Maximum hallucination probability |
| `facts` | list[object] | Per-fact annotations (see below) |

### Fact object schema

| Field | Type | Description |
|-------|------|-------------|
| `fact` | string | Extracted atomic fact |
| `span_text` | string | Localized span in the original answer |
| `span_start` | int | Character offset start in answer |
| `span_end` | int | Character offset end in answer |
| `span_kind` | string | `"argument"` or `"predicate"` |
| `entailment` | float | Entailment probability (E) |
| `neutral` | float | Neutral probability (N) |
| `contradiction` | float | Contradiction probability (C) |
| `hall_prob` | float | Hallucination probability (N + C) |

## Key Statistics

|  | Train | Test |
|--|-------|------|
| Examples | 19,594 | 1,995 |
| Unique contexts | 2,226 | 285 |
| Generator models | 7 | 7 (balanced) |
| Avg. answer length (chars) | 5,525 | 5,727 |
| Avg. full page context (chars) | ~13,700 | ~16,500 |
| Avg. facts per answer | --- | 355.4 |
| Mean hallucination rate | --- | 30.5% |

### Popularity distribution

| Tier | Train | Test |
|------|-------|------|
| Low (<100 views/day) | 11,423 | 1,071 |
| Medium (100-1,000) | 7,079 | 798 |
| High (>1,000) | 1,092 | 126 |

## Usage

```python
import json

# Load test set
test = []
with open("enokiqa_test.jsonl") as f:
    for line in f:
        test.append(json.loads(line))

# Example: get hallucinated facts for a record
rec = test[0]
hallucinated = [f for f in rec["facts"] if f["hall_prob"] >= 0.5]
print(f"{rec['title']}: {len(hallucinated)}/{rec['n_facts']} facts hallucinated")

# Example: filter by model
qwen3_8b = [r for r in test if r["answer_model"] == "Qwen_Qwen3-8B"]
```

## Annotation Method

Annotations are produced by an automatic pipeline:
1. **Fact decomposition**: Long-form answers are decomposed into atomic facts using an LLM
2. **Coreference resolution**: Facts are decontextualized using FastCoref
3. **NLI verification**: Each fact is checked against the full Wikipedia article using NLI (Qwen3-8B), producing entailment/neutral/contradiction probabilities
4. **Span localization**: Each fact is mapped back to a character span in the original answer

Hallucination probability is defined as `hall_prob = neutral + contradiction`.
A fact is considered hallucinated when `hall_prob >= 0.5`.

## License

Wikipedia content is distributed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
